Analysis of Bilboard lyrics 1994-2015

Nils Indreiten
10/09/2021

This analysis was inspired by part 5 of Julia Silge’s Test mining with tidy data principles course. The data set was curated by Karylin Pavlik and contains songs that listed on Billboard’s Year-End Hot 100 throughout five decades.

Tidying song lyrics

The variables in this dataset are the following:

Variable Detail
rank the rank a song achieved on the Billboard Year-End Hot 100
song the song’s title
artist the artist who recorded the song
year the year the song reached the given rank on the Billboard chart
lyrics the lyrics of the song

The dataset consists of more than 5000 songs, spanning from 1985 to 2015. The lyrics are in one column, so we need to convert it into tidy format. We can do so by using the unnest_token() function, tokenising and tidying the lyrics, creating a new word column:

tidy_lyrics <- lyrics %>% 
  # transform lyrics into word column
  unnest_tokens(word, Lyrics)

head(tidy_lyrics) %>% kable()
Rank Song Artist Year Source word
1 wooly bully sam the sham and the pharaohs 1965 3 sam
1 wooly bully sam the sham and the pharaohs 1965 3 the
1 wooly bully sam the sham and the pharaohs 1965 3 sham
1 wooly bully sam the sham and the pharaohs 1965 3 miscellaneous
1 wooly bully sam the sham and the pharaohs 1965 3 wooly
1 wooly bully sam the sham and the pharaohs 1965 3 bully

Data exploration:

We might be interested in what the most common words in the song lyrics are. We can see that words like ‘you’,‘the’ and ‘my’ are among the most common words. In contrast, words like ‘bottle’, ‘thang’ and ‘american’ are the least common words:

tidy_lyrics %>% 
  count(word, sort=TRUE) %>% head()
  word     n
1  you 64606
2    i 56472
3  the 53451
4   to 35752
5  and 32555
6   me 31170
tidy_lyrics %>% 
  count(word, sort=TRUE) %>% tail()
           word n
42152    zoomed 1
42153     zooms 1
42154    zooped 1
42155 zucchinis 1
42156      zulu 1
42157      zwei 1

The relationship between number of words in songs being released by artists over the decades, seems to be positively correlated, that is, as the years increases so too does the number of words in songs.

tidy_lyrics %>% 
  count(Year,Song) %>% 
  filter(n>1) %>% # Filter to include words appearing more than only one time
  ggplot(aes(Year,n))+
  geom_point(alpha=0.4, size=5,color="orange")+
  geom_smooth(method="lm", color="black")+
  theme_minimal()

Lets try to figure out which songs have very few or very many words:

tidy_lyrics %>% 
  count(Year, Song) %>% 
  arrange(-n) %>%  # modify to raange(n) to display the songs with least words 
head()
  Year                  Song    n
1 2007            im a flirt 1156
2 1998 been around the world 1149
3 2009               forever 1050
4 2010               forever 1050
5 2003        air force ones 1042
6 1988         dont be cruel 1038

We can extract a song of interest using filter(), in this case we filter for “wipe out” by The Surfaris:

lyrics %>% 
  filter(Song == "wipe out") 
  Rank     Song       Artist Year              Lyrics Source
1   63 wipe out the surfaris 1966  wipe out ha ha ha       1

Pop Vocab over the decades

In order to explore the evolution of pop song vocabulary over the decades we can build some linear models. The first step is to create a data set of word counts. This involves counting the number of words used in each song each year, group the data by year and create a new column containing the total words used each year. Finally we filter the data set to only include words above 500 total uses, as we don’t want to train models with words that are used sparingly:

word_counts <- tidy_lyrics %>% 
  anti_join(get_stopwords()) %>% 
  count(Year, word) %>% 
  # group by `year`
  group_by(Year) %>%
  # create a new column for the total words per year
  mutate(year_total = sum(n)) %>% 
  ungroup() %>% 
  # now group by `word`
  group_by(word) %>% 
  # keep only words used more than 500 times
  filter(sum(n) > 500) %>% 
  ungroup()

word_counts
# A tibble: 14,791 × 4
    Year word         n year_total
   <int> <chr>    <int>      <int>
 1  1965 ah          30       9845
 2  1965 aint        47       9845
 3  1965 alone        5       9845
 4  1965 alright      7       9845
 5  1965 always      14       9845
 6  1965 another     12       9845
 7  1965 anything     2       9845
 8  1965 arms        18       9845
 9  1965 around      21       9845
10  1965 away        30       9845
# … with 14,781 more rows

Now that we have our data set, we can use it to train many models, one per word. The broom package enables us to handle the model output. The creation of models involves creating list columns by nesting the word count data by word. We then use mutate() to create a new column for the models, thereby training a model for each word, where the number of successes (word counts) and failures (total counts per year) are predicted year:

library(broom)

slopes <- word_counts %>%
  nest_by(word) %>%
  # create a new column for our `model` objects
  mutate(model = list(glm(cbind(n, year_total) ~ Year, 
                          family = "binomial", data = data))) %>%
  summarize(tidy(model)) %>%
  ungroup() %>%
  # filter to only keep the "year" terms
  filter(term == "Year") %>%
  mutate(p.value = p.adjust(p.value)) %>%
  arrange(estimate)

slopes
# A tibble: 297 × 6
   word    term  estimate std.error statistic  p.value
   <chr>   <chr>    <dbl>     <dbl>     <dbl>    <dbl>
 1 woman   Year   -0.0446   0.00254     -17.5 1.93e-66
 2 lovin   Year   -0.0440   0.00289     -15.2 8.70e-50
 3 morning Year   -0.0409   0.00308     -13.3 7.07e-38
 4 sweet   Year   -0.0397   0.00207     -19.1 3.56e-79
 5 easy    Year   -0.0377   0.00310     -12.2 1.48e-31
 6 loves   Year   -0.0337   0.00304     -11.1 3.13e-26
 7 lonely  Year   -0.0309   0.00252     -12.2 4.63e-32
 8 help    Year   -0.0297   0.00253     -11.8 1.59e-29
 9 old     Year   -0.0280   0.00255     -11.0 1.41e-25
10 people  Year   -0.0275   0.00221     -12.4 3.97e-33
# … with 287 more rows

We can use a volcano plot to visualise all the models we trained. This type of plot allows us to compare the effect size and statistical significance

library(plotly)
p <- slopes %>% 
  ggplot(aes(estimate, p.value, label=word))+
  geom_vline(xintercept = 0, lty=3, size=1.5, alpha=0.7, color="gray50")+
  geom_point(color="pink", alpha=0.5, size=2.5)+
  scale_y_log10()+
  theme_light()

ggplotly(p)

Session Info

R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] plotly_4.9.4.1  broom_0.7.9     knitr_1.33      tidytext_0.3.1 
 [5] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4    
 [9] readr_2.0.1     tidyr_1.1.3     tibble_3.1.4    ggplot2_3.3.5  
[13] tidyverse_1.3.1

loaded via a namespace (and not attached):
 [1] httr_1.4.2        sass_0.4.0        viridisLite_0.4.0
 [4] jsonlite_1.7.2    splines_4.1.0     modelr_0.1.8     
 [7] bslib_0.3.0       assertthat_0.2.1  highr_0.9        
[10] cellranger_1.1.0  yaml_2.2.1        pillar_1.6.2     
[13] backports_1.2.1   lattice_0.20-44   glue_1.4.2       
[16] digest_0.6.27     rvest_1.0.1       colorspace_2.0-2 
[19] htmltools_0.5.2   Matrix_1.3-4      pkgconfig_2.0.3  
[22] haven_2.4.3       scales_1.1.1      distill_1.2      
[25] tzdb_0.1.2        downlit_0.2.1     mgcv_1.8-36      
[28] generics_0.1.0    farver_2.1.0      ellipsis_0.3.2   
[31] pacman_0.5.1      withr_2.4.2       lazyeval_0.2.2   
[34] cli_3.0.1         magrittr_2.0.1    crayon_1.4.1     
[37] readxl_1.3.1      evaluate_0.14     stopwords_2.2    
[40] tokenizers_0.2.1  janeaustenr_0.1.5 fs_1.5.0         
[43] fansi_0.5.0       nlme_3.1-152      SnowballC_0.7.0  
[46] xml2_1.3.2        data.table_1.14.0 tools_4.1.0      
[49] hms_1.1.0         lifecycle_1.0.0   munsell_0.5.0    
[52] reprex_2.0.1      compiler_4.1.0    jquerylib_0.1.4  
[55] rlang_0.4.11      grid_4.1.0        rstudioapi_0.13  
[58] htmlwidgets_1.5.3 crosstalk_1.1.1   labeling_0.4.2   
[61] rmarkdown_2.10    gtable_0.3.0      DBI_1.1.1        
[64] R6_2.5.1          lubridate_1.7.10  fastmap_1.1.0    
[67] utf8_1.2.2        stringi_1.7.4     Rcpp_1.0.7       
[70] vctrs_0.3.8       dbplyr_2.1.1      tidyselect_1.1.1 
[73] xfun_0.25